12 research outputs found

    Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

    Full text link
    Speechreading or lipreading is the technique of understanding and getting phonetic features from a speaker's visual features such as movement of lips, face, teeth and tongue. It has a wide range of multimedia applications such as in surveillance, Internet telephony, and as an aid to a person with hearing impairments. However, most of the work in speechreading has been limited to text generation from silent videos. Recently, research has started venturing into generating (audio) speech from silent video sequences but there have been no developments thus far in dealing with divergent views and poses of a speaker. Thus although, we have multiple camera feeds for the speech of a user, but we have failed in using these multiple video feeds for dealing with the different poses. To this end, this paper presents the world's first ever multi-view speech reading and reconstruction system. This work encompasses the boundaries of multimedia research by putting forth a model which leverages silent video feeds from multiple cameras recording the same subject to generate intelligent speech for a speaker. Initial results confirm the usefulness of exploiting multiple camera views in building an efficient speech reading and reconstruction system. It further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Next, it lays out various innovative applications for the proposed system focusing on its potential prodigious impact in not just security arena but in many other multimedia analytics problems.Comment: 2018 ACM Multimedia Conference (MM '18), October 22--26, 2018, Seoul, Republic of Kore

    The role of aural frequency analysis in pitch perception with simultaneous complex tones

    No full text
    Pitch perception has always been a relatively important issue in psychoacoustic literature. In particular the problem of complex-tone pitch, which does not simply depend on any single spectral frequency, has been the object of much interest during the past century. Since Seebeck (1841) discovered that upper partials contribute significantly to the pitch of complex tones, several mechanisms have been proposed such as nonlinear distortion creating a difference tone (Helmholtz, 1863; Fletcher, 1924), interference between unresolved partials causing a periodic envelope pattern (Schouten, 1940; Plomp, 1967), or some form of central neural processing (Goldstein, 1973; Wightman, 1973; Terhardt, 1972). Most modern pitch theories agree that the pitch of a complex tone is directly or indirectly derived from spectral frequencies which are resolved in the cochlea

    Quantifying sound quality in loudspeaker reproduction

    No full text
    We present PREQUEL: Perceptual Reproduction Quality Evaluation for Loudspeakers. Instead of quantifying the loudspeaker system itself, PREQUEL quantifies the overall loudspeakers' perceived sound quality by assessing their acoustic output using a set of music signals. This approach introduces a major problem: subjects cannot be provided with an acoustic reference signal and their judgment is based on an unknown, internal, reference. However, an objective perceptual assessment algorithm needs a reference signal in order to be able to predict the perceived sound quality. In this paper, these reference signals are created by making binaural recordings with a head and torso simulator, using the best quality loudspeakers, in the ideal listening spot in the best quality listening environment. The reproduced reference signal with the highest subjective quality is compared to the acoustic degraded loudspeaker output. PREQUEL is developed and, subsequently, validated, using three databases that contain binaurally recorded music fragments played over low to high quality loudspeakers in low to high quality listening rooms. The model shows a high average correlation (0.85) between objective and subjective measurements. PREQUEL thus allows prediction of the subjectively perceived sound quality of loudspeakers taking into account the influence of the listening room and the listening position

    Quantifying sound quality in loudspeaker reproduction

    No full text
    We present PREQUEL: Perceptual Reproduction Quality Evaluation for Loudspeakers. Instead of quantifying the loudspeaker system itself, PREQUEL quantifies the overall loudspeakers' perceived sound quality by assessing their acoustic output using a set of music signals. This approach introduces a major problem: subjects cannot be provided with an acoustic reference signal and their judgment is based on an unknown, internal, reference. However, an objective perceptual assessment algorithm needs a reference signal in order to be able to predict the perceived sound quality. In this paper, these reference signals are created by making binaural recordings with a head and torso simulator, using the best quality loudspeakers, in the ideal listening spot in the best quality listening environment. The reproduced reference signal with the highest subjective quality is compared to the acoustic degraded loudspeaker output. PREQUEL is developed and, subsequently, validated, using three databases that contain binaurally recorded music fragments played over low to high quality loudspeakers in low to high quality listening rooms. The model shows a high average correlation (0.85) between objective and subjective measurements. PREQUEL thus allows prediction of the subjectively perceived sound quality of loudspeakers taking into account the influence of the listening room and the listening position

    Parameter-based speech quality measures for GSM

    No full text

    Enhancing the Quality of Service of mobile video technology by increasing multimodal synergy

    Get PDF
    Bandwidth is still a limiting factor for the Quality of Service (QoS) of mobile communication applications. In particular, for Voice over IP the QoS is not yet as good as for common, well-engineered, public-switched telephone networks. Multisensory communication has been identified as a possibility to moderate this limitation. One of the strengths of mobile video technology lies in its combination of visual and auditory modalities. However, one of the most salient features of mobile video applications is its small screen size. To test the potential of multimodal synergy for mobile devices, we assessed to what extent small screens affect multimodal synergy. This potential was assessed in an experiment with 54 participants, who conducted a standardised video-listening test for three talking-heads videos with a signal-tonoise ratio of –9 dB. The videos were presented on three different screen sizes, whilst keeping the video and auditory signals equal. Compared to a ground truth based on 359 participants, intelligibility was found to be significantly higher when using a large screen than when using a small screen. This indicates that mobile video technology has the potential for a significant multimodal synergy to which screen size is a substantial constraint. To optimally benefit from their multimodal potential, we offer suggestions on how to increase the effective screen size for small screen (e.g. mobile) devices and applications through elaborating the most relevant (visual) features. We conclude that knowledge about human sensory processing can alleviate the identified constraint and maximise the potential QoS of mobile video technology

    Perceptual Evaluation Of Speech Quality (pesq) -- A

    No full text
    Previous objective speech quality assessment models, such as bark spectral distortion (BSD), the perceptual speech quality measure (PSQM), and measuring normalizing blocks (MNB), have been found to be suitable for assessing only a limited range of distortions. A new model has therefore been developed for use across a wider range of network conditions, including analogue connections, codecs, packet loss and variable delay
    corecore